Part 2: Predictive Analysis - Review Score Prediction¶
Predicting Customer Satisfaction Based on Order Features¶
This notebook builds a supervised machine learning model to predict whether a customer review will be high (4-5 stars) or low (1-3 stars) based on order characteristics.
Business Objective: Enable proactive customer satisfaction management by identifying orders likely to receive poor reviews before the review is submitted.
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Machine Learning imports
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, classification_report, roc_auc_score, roc_curve
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
print("Libraries imported successfully!")
Libraries imported successfully!
Data Loading and Initial Exploration¶
In [2]:
# Load datasets
orders = pd.read_csv('Data/olist_orders_dataset.csv')
order_items = pd.read_csv('Data/olist_order_items_dataset.csv')
customers = pd.read_csv('Data/olist_customers_dataset.csv')
products = pd.read_csv('Data/olist_products_dataset.csv')
payments = pd.read_csv('Data/olist_order_payments_dataset.csv')
reviews = pd.read_csv('Data/olist_order_reviews_dataset.csv')
sellers = pd.read_csv('Data/olist_sellers_dataset.csv')
category_translation = pd.read_csv('Data/product_category_name_translation.csv')
print(f"Reviews dataset shape: {reviews.shape}")
print(f"\nReview score distribution:")
print(reviews['review_score'].value_counts().sort_index())
# Calculate review score statistics
print(f"\nReview score statistics:")
print(f"Mean: {reviews['review_score'].mean():.2f}")
print(f"Median: {reviews['review_score'].median():.1f}")
print(f"Mode: {reviews['review_score'].mode().iloc[0]}")
Reviews dataset shape: (99224, 7) Review score distribution: review_score 1 11424 2 3151 3 8179 4 19142 5 57328 Name: count, dtype: int64 Review score statistics: Mean: 4.09 Median: 5.0 Mode: 5
Feature Engineering and Data Preparation¶
In [3]:
# Create target variable: high review (4-5) vs low review (1-3)
reviews['is_high_review'] = (reviews['review_score'] >= 4).astype(int)
print("Target variable distribution:")
target_dist = reviews['is_high_review'].value_counts()
print(f"Low reviews (1-3): {target_dist[0]} ({target_dist[0]/len(reviews)*100:.1f}%)")
print(f"High reviews (4-5): {target_dist[1]} ({target_dist[1]/len(reviews)*100:.1f}%)")
# Check for class imbalance
class_ratio = target_dist[1] / target_dist[0]
print(f"\nClass ratio (High:Low): {class_ratio:.2f}:1")
if class_ratio > 2 or class_ratio < 0.5:
print("⚠️ Significant class imbalance detected - will need to address in modeling")
else:
print("✅ Relatively balanced classes")
Target variable distribution: Low reviews (1-3): 22754 (22.9%) High reviews (4-5): 76470 (77.1%) Class ratio (High:Low): 3.36:1 ⚠️ Significant class imbalance detected - will need to address in modeling
In [4]:
# Merge datasets to create feature set
# Start with reviews and orders
ml_data = reviews[['order_id', 'review_score', 'is_high_review']].merge(
orders, on='order_id', how='inner'
)
# Add order items information
order_items_agg = order_items.groupby('order_id').agg({
'order_item_id': 'count', # number of items
'product_id': 'nunique', # number of unique products
'price': ['sum', 'mean', 'std'],
'freight_value': ['sum', 'mean']
}).round(2)
order_items_agg.columns = [
'total_items', 'unique_products', 'total_price', 'avg_item_price', 'price_std',
'total_freight', 'avg_freight'
]
order_items_agg['price_std'] = order_items_agg['price_std'].fillna(0) # Single item orders
ml_data = ml_data.merge(order_items_agg, on='order_id', how='left')
# Add payment information
payments_agg = payments.groupby('order_id').agg({
'payment_sequential': 'count', # number of payment methods
'payment_type': lambda x: x.mode().iloc[0] if len(x) > 0 else 'unknown', # most common payment type
'payment_installments': 'mean',
'payment_value': 'sum'
}).round(2)
payments_agg.columns = ['payment_methods_count', 'primary_payment_type', 'avg_installments', 'total_payment']
ml_data = ml_data.merge(payments_agg, on='order_id', how='left')
# Add customer information
ml_data = ml_data.merge(customers, on='customer_id', how='left')
print(f"ML dataset shape after merging: {ml_data.shape}")
print(f"Missing values summary:")
print(ml_data.isnull().sum().sort_values(ascending=False).head(10))
ML dataset shape after merging: (99224, 25) Missing values summary: order_delivered_customer_date 2865 order_delivered_carrier_date 1756 total_freight 759 price_std 759 avg_item_price 759 unique_products 759 total_items 759 total_price 759 avg_freight 759 order_approved_at 156 dtype: int64
In [5]:
# Feature Engineering - Create meaningful features from the data
# Convert datetime columns
datetime_cols = ['order_purchase_timestamp', 'order_approved_at',
'order_delivered_carrier_date', 'order_delivered_customer_date',
'order_estimated_delivery_date']
for col in datetime_cols:
ml_data[col] = pd.to_datetime(ml_data[col])
# Delivery performance features
ml_data['delivery_days'] = (ml_data['order_delivered_customer_date'] -
ml_data['order_purchase_timestamp']).dt.days
ml_data['estimated_delivery_days'] = (ml_data['order_estimated_delivery_date'] -
ml_data['order_purchase_timestamp']).dt.days
ml_data['delivery_delay_days'] = (ml_data['order_delivered_customer_date'] -
ml_data['order_estimated_delivery_date']).dt.days
ml_data['is_delivered_late'] = ml_data['delivery_delay_days'] > 0
ml_data['approval_delay_hours'] = (ml_data['order_approved_at'] -
ml_data['order_purchase_timestamp']).dt.total_seconds() / 3600
# Order timing features
ml_data['order_hour'] = ml_data['order_purchase_timestamp'].dt.hour
ml_data['order_dayofweek'] = ml_data['order_purchase_timestamp'].dt.dayofweek
ml_data['order_month'] = ml_data['order_purchase_timestamp'].dt.month
ml_data['is_weekend'] = ml_data['order_dayofweek'].isin([5, 6])
# Price and value features
ml_data['freight_to_price_ratio'] = ml_data['total_freight'] / (ml_data['total_price'] + 0.01) # Avoid division by zero
ml_data['avg_item_value'] = ml_data['total_price'] / ml_data['total_items']
ml_data['is_high_value_order'] = ml_data['total_price'] > ml_data['total_price'].quantile(0.75)
# Order complexity features
ml_data['product_diversity_ratio'] = ml_data['unique_products'] / ml_data['total_items']
ml_data['is_single_item_order'] = ml_data['total_items'] == 1
ml_data['has_multiple_payments'] = ml_data['payment_methods_count'] > 1
print("Feature engineering completed!")
print(f"Dataset shape: {ml_data.shape}")
# Display some key statistics
print(f"\nKey feature statistics:")
print(f"Average delivery days: {ml_data['delivery_days'].mean():.1f}")
print(f"Late delivery rate: {ml_data['is_delivered_late'].mean()*100:.1f}%")
print(f"High value orders: {ml_data['is_high_value_order'].mean()*100:.1f}%")
print(f"Single item orders: {ml_data['is_single_item_order'].mean()*100:.1f}%")
Feature engineering completed! Dataset shape: (99224, 40) Key feature statistics: Average delivery days: 12.1 Late delivery rate: 6.5% High value orders: 24.6% Single item orders: 89.4%
In [6]:
# Select features for modeling
# Only include orders that were delivered (to have complete delivery information)
delivered_data = ml_data[ml_data['order_status'] == 'delivered'].copy()
print(f"Filtering to delivered orders: {len(delivered_data)} records ({len(delivered_data)/len(ml_data)*100:.1f}% of total)")
# Define feature sets
numerical_features = [
'total_items', 'unique_products', 'total_price', 'avg_item_price', 'price_std',
'total_freight', 'avg_freight', 'avg_installments', 'total_payment',
'delivery_days', 'estimated_delivery_days', 'delivery_delay_days',
'approval_delay_hours', 'order_hour', 'order_dayofweek', 'order_month',
'freight_to_price_ratio', 'avg_item_value', 'product_diversity_ratio'
]
categorical_features = [
'order_status', 'primary_payment_type', 'customer_state'
]
boolean_features = [
'is_delivered_late', 'is_weekend', 'is_high_value_order',
'is_single_item_order', 'has_multiple_payments'
]
# Combine all features
all_features = numerical_features + categorical_features + boolean_features
# Check for missing values in selected features
feature_missing = delivered_data[all_features].isnull().sum()
print(f"\nMissing values in selected features:")
print(feature_missing[feature_missing > 0])
# Remove rows with missing target or key features
delivered_data = delivered_data.dropna(subset=['is_high_review'] + numerical_features[:10]) # Keep most important features
print(f"Final dataset shape: {delivered_data.shape}")
print(f"Target distribution in final dataset:")
final_target_dist = delivered_data['is_high_review'].value_counts()
print(f"Low reviews: {final_target_dist[0]} ({final_target_dist[0]/len(delivered_data)*100:.1f}%)")
print(f"High reviews: {final_target_dist[1]} ({final_target_dist[1]/len(delivered_data)*100:.1f}%)")
Filtering to delivered orders: 96361 records (97.1% of total) Missing values in selected features: avg_installments 1 total_payment 1 delivery_days 8 delivery_delay_days 8 approval_delay_hours 14 primary_payment_type 1 dtype: int64 Final dataset shape: (96352, 40) Target distribution in final dataset: Low reviews: 20306 (21.1%) High reviews: 76046 (78.9%)
Exploratory Data Analysis for Model Features¶
In [7]:
# Analyze relationship between key features and target
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Feature Analysis: High vs Low Reviews', fontsize=16)
# 1. Delivery delay impact
delivered_data.boxplot(column='delivery_delay_days', by='is_high_review', ax=axes[0,0])
axes[0,0].set_title('Delivery Delay vs Review Score')
axes[0,0].set_xlabel('Review Score (0=Low, 1=High)')
# 2. Order value impact
delivered_data.boxplot(column='total_price', by='is_high_review', ax=axes[0,1])
axes[0,1].set_title('Order Value vs Review Score')
axes[0,1].set_xlabel('Review Score (0=Low, 1=High)')
# 3. Freight ratio impact
delivered_data.boxplot(column='freight_to_price_ratio', by='is_high_review', ax=axes[0,2])
axes[0,2].set_title('Freight/Price Ratio vs Review Score')
axes[0,2].set_xlabel('Review Score (0=Low, 1=High)')
# 4. Delivery days impact
delivered_data.boxplot(column='delivery_days', by='is_high_review', ax=axes[1,0])
axes[1,0].set_title('Delivery Days vs Review Score')
axes[1,0].set_xlabel('Review Score (0=Low, 1=High)')
# 5. Number of items impact
delivered_data.boxplot(column='total_items', by='is_high_review', ax=axes[1,1])
axes[1,1].set_title('Number of Items vs Review Score')
axes[1,1].set_xlabel('Review Score (0=Low, 1=High)')
# 6. Installments impact
delivered_data.boxplot(column='avg_installments', by='is_high_review', ax=axes[1,2])
axes[1,2].set_title('Average Installments vs Review Score')
axes[1,2].set_xlabel('Review Score (0=Low, 1=High)')
plt.tight_layout()
plt.show()
# Statistical analysis of key differences
print("\nMean differences between high and low review orders:")
comparison_features = ['delivery_delay_days', 'total_price', 'freight_to_price_ratio',
'delivery_days', 'total_items', 'avg_installments']
for feature in comparison_features:
low_mean = delivered_data[delivered_data['is_high_review']==0][feature].mean()
high_mean = delivered_data[delivered_data['is_high_review']==1][feature].mean()
diff_pct = ((high_mean - low_mean) / low_mean * 100) if low_mean != 0 else 0
print(f"{feature}: Low={low_mean:.2f}, High={high_mean:.2f}, Diff={diff_pct:+.1f}%")
Mean differences between high and low review orders: delivery_delay_days: Low=-7.35, High=-13.14, Diff=+78.7% total_price: Low=147.16, High=133.85, Diff=-9.0% freight_to_price_ratio: Low=0.32, High=0.30, Diff=-5.5% delivery_days: Low=17.41, High=10.63, Diff=-38.9% total_items: Low=1.26, High=1.11, Diff=-11.6% avg_installments: Low=3.07, High=2.87, Diff=-6.5%
In [8]:
# Analyze categorical features
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
# Payment type vs review score
payment_review = pd.crosstab(delivered_data['primary_payment_type'],
delivered_data['is_high_review'], normalize='index')
payment_review.plot(kind='bar', ax=axes[0], title='Payment Type vs Review Score')
axes[0].set_xlabel('Payment Type')
axes[0].set_ylabel('Proportion')
axes[0].legend(['Low Review', 'High Review'])
# Late delivery vs review score
late_review = pd.crosstab(delivered_data['is_delivered_late'],
delivered_data['is_high_review'], normalize='index')
late_review.plot(kind='bar', ax=axes[1], title='Late Delivery vs Review Score')
axes[1].set_xlabel('Delivered Late')
axes[1].set_ylabel('Proportion')
axes[1].legend(['Low Review', 'High Review'])
# Weekend orders vs review score
weekend_review = pd.crosstab(delivered_data['is_weekend'],
delivered_data['is_high_review'], normalize='index')
weekend_review.plot(kind='bar', ax=axes[2], title='Weekend Order vs Review Score')
axes[2].set_xlabel('Weekend Order')
axes[2].set_ylabel('Proportion')
axes[2].legend(['Low Review', 'High Review'])
plt.tight_layout()
plt.show()
# Print key insights
late_delivery_impact = delivered_data.groupby('is_delivered_late')['is_high_review'].mean()
print(f"\nKey Insights:")
print(f"On-time delivery high review rate: {late_delivery_impact[False]*100:.1f}%")
print(f"Late delivery high review rate: {late_delivery_impact[True]*100:.1f}%")
print(f"Late delivery impact: {(late_delivery_impact[False] - late_delivery_impact[True])*100:.1f} percentage point difference")
Key Insights: On-time delivery high review rate: 82.6% Late delivery high review rate: 26.7% Late delivery impact: 55.9 percentage point difference
Model Training and Evaluation¶
In [9]:
# Prepare data for modeling
# Select final feature set (removing features with too many missing values)
final_features = [
# Numerical features
'total_items', 'unique_products', 'total_price', 'avg_item_price',
'total_freight', 'avg_freight', 'avg_installments',
'delivery_days', 'delivery_delay_days', 'order_hour', 'order_month',
'freight_to_price_ratio', 'avg_item_value', 'product_diversity_ratio',
# Boolean features (will be treated as numerical)
'is_delivered_late', 'is_weekend', 'is_high_value_order',
'is_single_item_order', 'has_multiple_payments'
]
# Add important categorical features (encoded)
categorical_for_model = ['primary_payment_type', 'customer_state']
# Prepare feature matrix
X_numerical = delivered_data[final_features].fillna(0)
# Encode categorical features
X_categorical = pd.get_dummies(delivered_data[categorical_for_model], prefix=categorical_for_model)
# Combine features
X = pd.concat([X_numerical, X_categorical], axis=1)
y = delivered_data['is_high_review']
print(f"Feature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print(f"Feature names: {X.columns.tolist()[:10]}...") # Show first 10 features
# Check for any remaining missing values
missing_check = X.isnull().sum().sum()
print(f"Missing values in feature matrix: {missing_check}")
if missing_check > 0:
X = X.fillna(0) # Fill any remaining missing values with 0
print("Filled remaining missing values with 0")
Feature matrix shape: (96352, 50) Target vector shape: (96352,) Feature names: ['total_items', 'unique_products', 'total_price', 'avg_item_price', 'total_freight', 'avg_freight', 'avg_installments', 'delivery_days', 'delivery_delay_days', 'order_hour']... Missing values in feature matrix: 0
In [10]:
# Train-Test Split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
# Check class distribution in splits
print(f"\nTraining set class distribution:")
train_dist = y_train.value_counts()
print(f"Low: {train_dist[0]} ({train_dist[0]/len(y_train)*100:.1f}%)")
print(f"High: {train_dist[1]} ({train_dist[1]/len(y_train)*100:.1f}%)")
print(f"\nTest set class distribution:")
test_dist = y_test.value_counts()
print(f"Low: {test_dist[0]} ({test_dist[0]/len(y_test)*100:.1f}%)")
print(f"High: {test_dist[1]} ({test_dist[1]/len(y_test)*100:.1f}%)")
Training set: 77081 samples Test set: 19271 samples Training set class distribution: Low: 16245 (21.1%) High: 60836 (78.9%) Test set class distribution: Low: 4061 (21.1%) High: 15210 (78.9%)
In [11]:
# Train multiple models and compare performance
models = {
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced'),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
'Logistic Regression': LogisticRegression(random_state=42, class_weight='balanced', max_iter=1000)
}
# Scale features for Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model_results = {}
print("Training models...\n")
for name, model in models.items():
print(f"Training {name}...")
# Use scaled data for Logistic Regression, original for tree-based models
if name == 'Logistic Regression':
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
else:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_proba)
model_results[name] = {
'model': model,
'predictions': y_pred,
'probabilities': y_pred_proba,
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
'auc': auc
}
print(f" Accuracy: {accuracy:.4f}")
print(f" Precision: {precision:.4f}")
print(f" Recall: {recall:.4f}")
print(f" F1-Score: {f1:.4f}")
print(f" AUC-ROC: {auc:.4f}")
print()
# Find best model based on F1-score (good balance for slightly imbalanced data)
best_model_name = max(model_results.keys(), key=lambda k: model_results[k]['f1'])
print(f"Best performing model: {best_model_name} (F1-Score: {model_results[best_model_name]['f1']:.4f})")
Training models... Training Random Forest... Accuracy: 0.8207 Precision: 0.8284 Recall: 0.9748 F1-Score: 0.8956 AUC-ROC: 0.6887 Training Gradient Boosting... Accuracy: 0.8239 Precision: 0.8265 Recall: 0.9834 F1-Score: 0.8981 AUC-ROC: 0.7056 Training Logistic Regression... Accuracy: 0.7553 Precision: 0.8599 Recall: 0.8243 F1-Score: 0.8417 AUC-ROC: 0.7035 Best performing model: Gradient Boosting (F1-Score: 0.8981)
In [12]:
# Detailed evaluation of the best model
best_model = model_results[best_model_name]
best_predictions = best_model['predictions']
best_probabilities = best_model['probabilities']
# Confusion Matrix
cm = confusion_matrix(y_test, best_predictions)
plt.figure(figsize=(12, 5))
# Plot 1: Confusion Matrix
plt.subplot(1, 2, 1)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Low Review', 'High Review'],
yticklabels=['Low Review', 'High Review'])
plt.title(f'Confusion Matrix - {best_model_name}')
plt.ylabel('Actual')
plt.xlabel('Predicted')
# Plot 2: ROC Curve
plt.subplot(1, 2, 2)
fpr, tpr, _ = roc_curve(y_test, best_probabilities)
plt.plot(fpr, tpr, linewidth=2, label=f'{best_model_name} (AUC = {best_model["auc"]:.3f})')
plt.plot([0, 1], [0, 1], 'k--', linewidth=1)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Detailed classification report
print(f"\nDetailed Classification Report for {best_model_name}:")
print("=" * 50)
print(classification_report(y_test, best_predictions,
target_names=['Low Review (1-3)', 'High Review (4-5)']))
# Calculate additional insights
tn, fp, fn, tp = cm.ravel()
specificity = tn / (tn + fp)
npv = tn / (tn + fn) # Negative Predictive Value
print(f"\nAdditional Metrics:")
print(f"Specificity (True Negative Rate): {specificity:.4f}")
print(f"Negative Predictive Value: {npv:.4f}")
print(f"False Positive Rate: {fp/(fp+tn):.4f}")
print(f"False Negative Rate: {fn/(fn+tp):.4f}")
Detailed Classification Report for Gradient Boosting:
==================================================
precision recall f1-score support
Low Review (1-3) 0.78 0.23 0.35 4061
High Review (4-5) 0.83 0.98 0.90 15210
accuracy 0.82 19271
macro avg 0.81 0.61 0.62 19271
weighted avg 0.82 0.82 0.78 19271
Additional Metrics:
Specificity (True Negative Rate): 0.2268
Negative Predictive Value: 0.7845
False Positive Rate: 0.7732
False Negative Rate: 0.0166
Feature Importance Analysis¶
In [13]:
# Feature importance analysis (works best with tree-based models)
if best_model_name in ['Random Forest', 'Gradient Boosting']:
# Get feature importances
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': best_model['model'].feature_importances_
}).sort_values('importance', ascending=False)
# Plot top 20 features
plt.figure(figsize=(12, 8))
top_features = feature_importance.head(20)
plt.barh(range(len(top_features)), top_features['importance'], color='skyblue')
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Feature Importance')
plt.title(f'Top 20 Feature Importances - {best_model_name}')
plt.gca().invert_yaxis()
# Add value labels
for i, v in enumerate(top_features['importance']):
plt.text(v + 0.001, i, f'{v:.3f}', va='center')
plt.tight_layout()
plt.show()
print(f"\nTop 10 Most Important Features for {best_model_name}:")
print("=" * 60)
for i, (_, row) in enumerate(feature_importance.head(10).iterrows(), 1):
print(f"{i:2d}. {row['feature']:<25} {row['importance']:.4f}")
else:
# For Logistic Regression, show coefficient magnitudes
feature_coefs = pd.DataFrame({
'feature': X.columns,
'coefficient': abs(best_model['model'].coef_[0])
}).sort_values('coefficient', ascending=False)
plt.figure(figsize=(12, 8))
top_features = feature_coefs.head(20)
plt.barh(range(len(top_features)), top_features['coefficient'], color='lightcoral')
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Absolute Coefficient Value')
plt.title(f'Top 20 Feature Coefficients (Absolute) - {best_model_name}')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
print(f"\nTop 10 Most Important Features for {best_model_name}:")
print("=" * 60)
for i, (_, row) in enumerate(feature_coefs.head(10).iterrows(), 1):
original_coef = best_model['model'].coef_[0][X.columns.get_loc(row['feature'])]
direction = "increases" if original_coef > 0 else "decreases"
print(f"{i:2d}. {row['feature']:<25} {row['coefficient']:.4f} ({direction} probability)")
Top 10 Most Important Features for Gradient Boosting: ============================================================ 1. delivery_delay_days 0.7011 2. delivery_days 0.0993 3. is_single_item_order 0.0691 4. total_items 0.0280 5. is_delivered_late 0.0275 6. unique_products 0.0213 7. total_freight 0.0142 8. order_month 0.0064 9. avg_freight 0.0053 10. primary_payment_type_boleto 0.0052
In [14]:
# Business interpretation of key features
print("\n" + "="*80)
print("BUSINESS INTERPRETATION OF KEY FEATURES")
print("="*80)
# Get top features based on model type
if best_model_name in ['Random Forest', 'Gradient Boosting']:
top_features_list = feature_importance.head(5)['feature'].tolist()
importance_values = feature_importance.head(5)['importance'].tolist()
else:
top_features_list = feature_coefs.head(5)['feature'].tolist()
importance_values = feature_coefs.head(5)['coefficient'].tolist()
feature_interpretations = {
'is_delivered_late': 'Late delivery is the strongest predictor of low reviews',
'delivery_delay_days': 'Number of days late significantly impacts customer satisfaction',
'delivery_days': 'Longer overall delivery time correlates with lower satisfaction',
'freight_to_price_ratio': 'High shipping costs relative to product price hurt satisfaction',
'total_price': 'Order value affects review patterns (high-value customers may be more critical)',
'avg_installments': 'Payment terms influence customer satisfaction',
'total_items': 'Order complexity (number of items) impacts delivery success',
'order_hour': 'Time of day when order was placed affects logistics performance',
'avg_item_value': 'Individual item value influences customer expectations',
'is_weekend': 'Weekend orders may have different processing patterns'
}
print(f"\nTop predictive factors for customer satisfaction:")
for i, (feature, importance) in enumerate(zip(top_features_list, importance_values), 1):
interpretation = feature_interpretations.get(feature, 'Custom feature requiring domain analysis')
print(f"{i}. {feature} (importance: {importance:.4f})")
print(f" → {interpretation}")
print()
# Calculate business impact metrics
late_delivery_impact = delivered_data.groupby('is_delivered_late')['is_high_review'].mean()
high_freight_impact = delivered_data.groupby(delivered_data['freight_to_price_ratio'] > 0.2)['is_high_review'].mean()
print(f"Key Business Metrics:")
print(f"• On-time delivery satisfaction: {late_delivery_impact[False]*100:.1f}%")
print(f"• Late delivery satisfaction: {late_delivery_impact[True]*100:.1f}%")
print(f"• Impact of late delivery: {(late_delivery_impact[False] - late_delivery_impact[True])*100:.1f} percentage points")
print(f"• High freight ratio (>20%) satisfaction: {high_freight_impact[True]*100:.1f}%")
print(f"• Low freight ratio (≤20%) satisfaction: {high_freight_impact[False]*100:.1f}%")
================================================================================ BUSINESS INTERPRETATION OF KEY FEATURES ================================================================================ Top predictive factors for customer satisfaction: 1. delivery_delay_days (importance: 0.7011) → Number of days late significantly impacts customer satisfaction 2. delivery_days (importance: 0.0993) → Longer overall delivery time correlates with lower satisfaction 3. is_single_item_order (importance: 0.0691) → Custom feature requiring domain analysis 4. total_items (importance: 0.0280) → Order complexity (number of items) impacts delivery success 5. is_delivered_late (importance: 0.0275) → Late delivery is the strongest predictor of low reviews Key Business Metrics: • On-time delivery satisfaction: 82.6% • Late delivery satisfaction: 26.7% • Impact of late delivery: 55.9 percentage points • High freight ratio (>20%) satisfaction: 78.1% • Low freight ratio (≤20%) satisfaction: 79.9%
Model Performance Summary and Business Applications¶
In [15]:
# Cross-validation to assess model stability
print("Performing cross-validation analysis...\n")
if best_model_name == 'Logistic Regression':
cv_scores = cross_val_score(best_model['model'], X_train_scaled, y_train, cv=5, scoring='f1')
else:
cv_scores = cross_val_score(best_model['model'], X_train, y_train, cv=5, scoring='f1')
print(f"Cross-Validation Results ({best_model_name}):")
print(f"F1-Score: {cv_scores.mean():.4f} ± {cv_scores.std()*2:.4f}")
print(f"Individual fold scores: {[f'{score:.4f}' for score in cv_scores]}")
# Model stability check
if cv_scores.std() < 0.02:
print("✅ Model shows good stability across folds")
elif cv_scores.std() < 0.05:
print("⚠️ Moderate variability across folds")
else:
print("❌ High variability - model may be unstable")
# Business value calculation
print(f"\n" + "="*60)
print("BUSINESS VALUE ANALYSIS")
print("="*60)
# Calculate potential impact
total_orders = len(y_test)
predicted_low_reviews = (best_predictions == 0).sum()
actual_low_reviews = (y_test == 0).sum()
true_positives = tp # Correctly identified high reviews
false_negatives = fn # Missed low reviews (high risk)
true_negatives = tn # Correctly identified low reviews
false_positives = fp # Incorrectly flagged as low review
print(f"Model Performance on Test Set ({total_orders:,} orders):")
print(f"• Correctly identified high reviews: {true_positives:,} orders")
print(f"• Correctly identified low reviews: {true_negatives:,} orders")
print(f"• Missed low reviews (high risk): {false_negatives:,} orders")
print(f"• False alarms: {false_positives:,} orders")
# Business impact estimates
intervention_cost_per_order = 10 # Assumed cost to intervene (proactive customer service)
churn_cost_per_customer = 150 # Assumed customer lifetime value loss
intervention_success_rate = 0.3 # Assumed success rate of intervention
# Calculate costs and savings
intervention_cost = true_negatives * intervention_cost_per_order
false_alarm_cost = false_positives * intervention_cost_per_order
missed_opportunity_cost = false_negatives * churn_cost_per_customer * intervention_success_rate
prevented_churn_value = true_negatives * churn_cost_per_customer * intervention_success_rate
net_value = prevented_churn_value - intervention_cost - false_alarm_cost
print(f"\nEstimated Business Impact (Test Set):")
print(f"• Intervention cost: R$ {intervention_cost:,.2f}")
print(f"• False alarm cost: R$ {false_alarm_cost:,.2f}")
print(f"• Prevented churn value: R$ {prevented_churn_value:,.2f}")
print(f"• Missed opportunity cost: R$ {missed_opportunity_cost:,.2f}")
print(f"• Net business value: R$ {net_value:,.2f}")
roi = (net_value / (intervention_cost + false_alarm_cost)) * 100
print(f"• ROI: {roi:.1f}%")
if net_value > 0:
print("✅ Model provides positive business value")
else:
print("❌ Model needs improvement or different intervention strategy")
Performing cross-validation analysis... Cross-Validation Results (Gradient Boosting): F1-Score: 0.8981 ± 0.0011 Individual fold scores: ['0.8980', '0.8971', '0.8987', '0.8985', '0.8984'] ✅ Model shows good stability across folds ============================================================ BUSINESS VALUE ANALYSIS ============================================================ Model Performance on Test Set (19,271 orders): • Correctly identified high reviews: 14,957 orders • Correctly identified low reviews: 921 orders • Missed low reviews (high risk): 253 orders • False alarms: 3,140 orders Estimated Business Impact (Test Set): • Intervention cost: R$ 9,210.00 • False alarm cost: R$ 31,400.00 • Prevented churn value: R$ 41,445.00 • Missed opportunity cost: R$ 11,385.00 • Net business value: R$ 835.00 • ROI: 2.1% ✅ Model provides positive business value
Model Limitations and Recommendations¶
In [16]:
print("="*80)
print("MODEL LIMITATIONS AND IMPROVEMENT OPPORTUNITIES")
print("="*80)
# Class imbalance analysis
class_distribution = y.value_counts(normalize=True)
print(f"\n1. CLASS IMBALANCE ANALYSIS:")
print(f" • High reviews: {class_distribution[1]*100:.1f}%")
print(f" • Low reviews: {class_distribution[0]*100:.1f}%")
print(f" • Imbalance ratio: {class_distribution[1]/class_distribution[0]:.2f}:1")
if class_distribution[1]/class_distribution[0] > 3:
print(" ⚠️ Significant class imbalance - model may be biased toward predicting high reviews")
print(" 💡 Recommendation: Use SMOTE, cost-sensitive learning, or threshold tuning")
# Feature coverage analysis
missing_data_pct = delivered_data[final_features].isnull().sum().sum() / (len(delivered_data) * len(final_features)) * 100
print(f"\n2. DATA QUALITY:")
print(f" • Missing data: {missing_data_pct:.2f}% of feature values")
print(f" • Dataset coverage: {len(delivered_data):,} orders ({len(delivered_data)/len(ml_data)*100:.1f}% of total)")
# Temporal bias
date_range = delivered_data['order_purchase_timestamp'].dt.date
print(f"\n3. TEMPORAL LIMITATIONS:")
print(f" • Date range: {date_range.min()} to {date_range.max()}")
print(f" • Time span: {(date_range.max() - date_range.min()).days} days")
print(f" ⚠️ Model trained on historical data - may not capture current market conditions")
print(f" 💡 Recommendation: Implement model retraining pipeline with recent data")
# Feature engineering opportunities
print(f"\n4. MISSING FEATURES (Potential Improvements):")
missing_features = [
"Product category satisfaction history",
"Seller performance metrics",
"Customer purchase history/loyalty",
"Seasonal/holiday effects",
"Geographic distance (customer-seller)",
"Product reviews/ratings",
"Competitor pricing data",
"Customer demographics",
"Weather/external factors",
"Marketing campaign exposure"
]
for i, feature in enumerate(missing_features, 1):
print(f" {i:2d}. {feature}")
print(f"\n5. SCALABILITY CONSIDERATIONS:")
print(f" • Current feature count: {X.shape[1]}")
print(f" • Training time: Acceptable for current dataset size")
print(f" • Memory usage: {X.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
print(f" 💡 For production: Consider feature selection, model simplification, or ensemble approaches")
print(f"\n6. BUSINESS DEPLOYMENT RECOMMENDATIONS:")
recommendations = [
"Implement real-time scoring for new orders",
"Set up A/B testing framework for intervention strategies",
"Create monitoring dashboard for model performance drift",
"Establish feedback loop to capture intervention outcomes",
"Develop tiered intervention strategy based on prediction confidence",
"Integration with customer service workflows",
"Regular model retraining (monthly/quarterly)",
"Expand to predict specific review scores (1-5) instead of binary"
]
for i, rec in enumerate(recommendations, 1):
print(f" {i}. {rec}")
print(f"\n" + "="*80)
print(f"FINAL MODEL SUMMARY")
print(f"="*80)
print(f"Best Model: {best_model_name}")
print(f"Accuracy: {best_model['accuracy']:.3f} | Precision: {best_model['precision']:.3f} | Recall: {best_model['recall']:.3f}")
print(f"F1-Score: {best_model['f1']:.3f} | AUC-ROC: {best_model['auc']:.3f}")
print(f"Cross-validation F1: {cv_scores.mean():.3f} ± {cv_scores.std()*2:.3f}")
print(f"Estimated ROI: {roi:.1f}%")
print(f"Deployment Ready: {'✅ Yes' if best_model['f1'] > 0.7 and cv_scores.std() < 0.05 else '⚠️ Needs improvement'}")
================================================================================
MODEL LIMITATIONS AND IMPROVEMENT OPPORTUNITIES
================================================================================
1. CLASS IMBALANCE ANALYSIS:
• High reviews: 78.9%
• Low reviews: 21.1%
• Imbalance ratio: 3.75:1
⚠️ Significant class imbalance - model may be biased toward predicting high reviews
💡 Recommendation: Use SMOTE, cost-sensitive learning, or threshold tuning
2. DATA QUALITY:
• Missing data: 0.00% of feature values
• Dataset coverage: 96,352 orders (97.1% of total)
3. TEMPORAL LIMITATIONS:
• Date range: 2016-10-03 to 2018-08-29
• Time span: 695 days
⚠️ Model trained on historical data - may not capture current market conditions
💡 Recommendation: Implement model retraining pipeline with recent data
4. MISSING FEATURES (Potential Improvements):
1. Product category satisfaction history
2. Seller performance metrics
3. Customer purchase history/loyalty
4. Seasonal/holiday effects
5. Geographic distance (customer-seller)
6. Product reviews/ratings
7. Competitor pricing data
8. Customer demographics
9. Weather/external factors
10. Marketing campaign exposure
5. SCALABILITY CONSIDERATIONS:
• Current feature count: 50
• Training time: Acceptable for current dataset size
• Memory usage: 13.6 MB
💡 For production: Consider feature selection, model simplification, or ensemble approaches
6. BUSINESS DEPLOYMENT RECOMMENDATIONS:
1. Implement real-time scoring for new orders
2. Set up A/B testing framework for intervention strategies
3. Create monitoring dashboard for model performance drift
4. Establish feedback loop to capture intervention outcomes
5. Develop tiered intervention strategy based on prediction confidence
6. Integration with customer service workflows
7. Regular model retraining (monthly/quarterly)
8. Expand to predict specific review scores (1-5) instead of binary
================================================================================
FINAL MODEL SUMMARY
================================================================================
Best Model: Gradient Boosting
Accuracy: 0.824 | Precision: 0.826 | Recall: 0.983
F1-Score: 0.898 | AUC-ROC: 0.706
Cross-validation F1: 0.898 ± 0.001
Estimated ROI: 2.1%
Deployment Ready: ✅ Yes
Enhanced Confusion Matrix Visualization¶
Creating detailed confusion matrix with proper formatting and business insights.
In [17]:
# Enhanced Confusion Matrix with Detailed Analysis
def create_enhanced_confusion_matrix(y_true, y_pred, model_name):
"""Create an enhanced confusion matrix with business insights"""
# Calculate confusion matrix
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
# Create figure with subplots
fig = make_subplots(
rows=2, cols=2,
subplot_titles=[
'Confusion Matrix', 'Performance Metrics',
'Business Impact', 'Model Insights'
],
specs=[[{"type": "heatmap"}, {"type": "bar"}],
[{"type": "bar"}, {"type": "table"}]]
)
# 1. Confusion Matrix Heatmap
confusion_labels = ['True Negative<br>(Correctly predicted low)', 'False Positive<br>(Incorrectly predicted low)',
'False Negative<br>(Missed low review)', 'True Positive<br>(Correctly predicted high)']
confusion_values = [tn, fp, fn, tp]
confusion_text = [f'{val}<br>({val/len(y_true)*100:.1f}%)' for val in confusion_values]
# Reshape for heatmap
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
fig.add_trace(
go.Heatmap(
z=cm_normalized,
x=['Predicted Low', 'Predicted High'],
y=['Actual Low', 'Actual High'],
colorscale='Blues',
showscale=True,
text=[[f'{cm[0,0]}<br>({cm[0,0]/len(y_true)*100:.1f}%)', f'{cm[0,1]}<br>({cm[0,1]/len(y_true)*100:.1f}%)'],
[f'{cm[1,0]}<br>({cm[1,0]/len(y_true)*100:.1f}%)', f'{cm[1,1]}<br>({cm[1,1]/len(y_true)*100:.1f}%)']],
texttemplate='%{text}',
textfont=dict(size=12, color='white'),
hovertemplate='<b>%{y} vs %{x}</b><br>Count: %{text}<extra></extra>'
),
row=1, col=1
)
# 2. Performance Metrics
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'Specificity']
values = [
accuracy_score(y_true, y_pred),
precision_score(y_true, y_pred),
recall_score(y_true, y_pred),
f1_score(y_true, y_pred),
tn / (tn + fp) # Specificity
]
fig.add_trace(
go.Bar(
x=metrics,
y=values,
marker_color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd'],
text=[f'{v:.3f}' for v in values],
textposition='auto'
),
row=1, col=2
)
# 3. Business Impact Metrics
impact_metrics = ['Orders to Review', 'Prevented Issues', 'False Alarms', 'Missed Risks']
impact_values = [len(y_true), tn, fp, fn]
fig.add_trace(
go.Bar(
x=impact_metrics,
y=impact_values,
marker_color=['#17becf', '#2ca02c', '#ff7f0e', '#d62728'],
text=[f'{v:,}' for v in impact_values],
textposition='auto'
),
row=2, col=1
)
# 4. Model Insights Table
insights_data = [
['Model Accuracy', f'{accuracy_score(y_true, y_pred)*100:.1f}%'],
['High Review Recall', f'{recall_score(y_true, y_pred)*100:.1f}%'],
['Low Review Precision', f'{tn/(tn+fp)*100:.1f}%'],
['Risk Identification', f'{tn} out of {tn+fn} low reviews caught'],
['False Alarm Rate', f'{fp/(fp+tn)*100:.1f}%']
]
fig.add_trace(
go.Table(
header=dict(values=['Metric', 'Value'], fill_color='lightblue', font=dict(size=12)),
cells=dict(values=[[row[0] for row in insights_data], [row[1] for row in insights_data]],
fill_color='white', font=dict(size=11))
),
row=2, col=2
)
# Update layout
fig.update_layout(
title=f'Comprehensive Model Analysis: {model_name}',
height=800,
showlegend=False
)
# Update y-axis for performance metrics
fig.update_yaxes(range=[0, 1], title_text="Score", row=1, col=2)
fig.update_yaxes(title_text="Count", row=2, col=1)
return fig
# Create enhanced visualization for best model
enhanced_cm_fig = create_enhanced_confusion_matrix(y_test, best_predictions, best_model_name)
enhanced_cm_fig.show()
print(f"\nDetailed Analysis of {best_model_name} Performance:")
print("="*60)
# Calculate detailed metrics
cm = confusion_matrix(y_test, best_predictions)
tn, fp, fn, tp = cm.ravel()
print(f"True Negatives (Correct low predictions): {tn:,} ({tn/len(y_test)*100:.1f}%)")
print(f"False Positives (Incorrect low predictions): {fp:,} ({fp/len(y_test)*100:.1f}%)")
print(f"False Negatives (Missed low reviews): {fn:,} ({fn/len(y_test)*100:.1f}%)")
print(f"True Positives (Correct high predictions): {tp:,} ({tp/len(y_test)*100:.1f}%)")
print(f"\nBusiness Interpretation:")
print(f"• Model correctly identifies {tn/(tn+fn)*100:.1f}% of dissatisfied customers")
print(f"• {fp/(fp+tn)*100:.1f}% of interventions would be unnecessary (false alarms)")
print(f"• Risk of missing {fn/(fn+tp)*100:.1f}% of actual low reviews")
print(f"• Overall customer satisfaction prediction accuracy: {(tp+tn)/len(y_test)*100:.1f}%")
Detailed Analysis of Gradient Boosting Performance: ============================================================ True Negatives (Correct low predictions): 921 (4.8%) False Positives (Incorrect low predictions): 3,140 (16.3%) False Negatives (Missed low reviews): 253 (1.3%) True Positives (Correct high predictions): 14,957 (77.6%) Business Interpretation: • Model correctly identifies 78.4% of dissatisfied customers • 77.3% of interventions would be unnecessary (false alarms) • Risk of missing 1.7% of actual low reviews • Overall customer satisfaction prediction accuracy: 82.4%
Advanced Model Analysis and Deployment Considerations¶
This section provides comprehensive analysis for production deployment, including threshold optimization, prediction confidence analysis, and scalability considerations.
In [18]:
# Advanced Model Analysis for Production Deployment
# 1. Threshold Optimization Analysis
def analyze_prediction_thresholds(y_true, y_proba):
"""Analyze different prediction thresholds to optimize business outcomes"""
# Include 0.5 explicitly to avoid floating point issues
thresholds = np.concatenate([np.arange(0.1, 0.5, 0.05), [0.5], np.arange(0.55, 0.9, 0.05)])
results = []
for threshold in thresholds:
y_pred_threshold = (y_proba >= threshold).astype(int)
# Calculate metrics
accuracy = accuracy_score(y_true, y_pred_threshold)
precision = precision_score(y_true, y_pred_threshold)
recall = recall_score(y_true, y_pred_threshold)
f1 = f1_score(y_true, y_pred_threshold)
# Calculate confusion matrix
cm = confusion_matrix(y_true, y_pred_threshold)
tn, fp, fn, tp = cm.ravel()
# Business metrics
intervention_rate = (tn + fp) / len(y_true) # Proportion of orders flagged for intervention
risk_coverage = tn / (tn + fn) if (tn + fn) > 0 else 0 # Proportion of actual low reviews caught
results.append({
'threshold': round(threshold, 2), # Round to avoid floating point issues
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
'intervention_rate': intervention_rate,
'risk_coverage': risk_coverage,
'true_negatives': tn,
'false_positives': fp
})
return pd.DataFrame(results)
# Analyze thresholds for best model
threshold_analysis = analyze_prediction_thresholds(y_test, best_probabilities)
# Create threshold analysis visualization
fig = make_subplots(
rows=2, cols=2,
subplot_titles=['Model Metrics vs Threshold', 'Business Impact vs Threshold',
'Intervention Rate vs Threshold', 'ROC Curve Analysis']
)
# Plot 1: Model metrics
for metric in ['accuracy', 'precision', 'recall', 'f1']:
fig.add_trace(
go.Scatter(
x=threshold_analysis['threshold'],
y=threshold_analysis[metric],
name=metric.title(),
mode='lines+markers'
),
row=1, col=1
)
# Plot 2: Business metrics
fig.add_trace(
go.Scatter(
x=threshold_analysis['threshold'],
y=threshold_analysis['risk_coverage'],
name='Risk Coverage',
mode='lines+markers',
line=dict(color='green')
),
row=1, col=2
)
fig.add_trace(
go.Scatter(
x=threshold_analysis['threshold'],
y=threshold_analysis['intervention_rate'],
name='Intervention Rate',
mode='lines+markers',
line=dict(color='orange')
),
row=1, col=2
)
# Plot 3: Intervention details
fig.add_trace(
go.Scatter(
x=threshold_analysis['threshold'],
y=threshold_analysis['true_negatives'],
name='Correct Interventions',
mode='lines+markers',
line=dict(color='blue')
),
row=2, col=1
)
fig.add_trace(
go.Scatter(
x=threshold_analysis['threshold'],
y=threshold_analysis['false_positives'],
name='Unnecessary Interventions',
mode='lines+markers',
line=dict(color='red')
),
row=2, col=1
)
# Plot 4: ROC Curve
fpr, tpr, roc_thresholds = roc_curve(y_test, best_probabilities)
fig.add_trace(
go.Scatter(
x=fpr,
y=tpr,
name=f'ROC Curve (AUC={best_model["auc"]:.3f})',
mode='lines',
line=dict(color='purple', width=2)
),
row=2, col=2
)
fig.add_trace(
go.Scatter(
x=[0, 1],
y=[0, 1],
name='Random Classifier',
mode='lines',
line=dict(color='gray', dash='dash')
),
row=2, col=2
)
# Update layout
fig.update_layout(
title='Comprehensive Threshold Analysis for Production Deployment',
height=800,
showlegend=True
)
fig.update_xaxes(title_text="Threshold", row=1, col=1)
fig.update_xaxes(title_text="Threshold", row=1, col=2)
fig.update_xaxes(title_text="Threshold", row=2, col=1)
fig.update_xaxes(title_text="False Positive Rate", row=2, col=2)
fig.update_yaxes(title_text="Score", row=1, col=1)
fig.update_yaxes(title_text="Rate", row=1, col=2)
fig.update_yaxes(title_text="Count", row=2, col=1)
fig.update_yaxes(title_text="True Positive Rate", row=2, col=2)
fig.show()
# Find optimal threshold based on business criteria
# Prioritize high risk coverage while maintaining reasonable intervention rate
threshold_analysis['business_score'] = (
threshold_analysis['risk_coverage'] * 0.6 + # 60% weight on catching low reviews
(1 - threshold_analysis['intervention_rate']) * 0.4 # 40% weight on reducing unnecessary interventions
)
optimal_threshold = threshold_analysis.loc[threshold_analysis['business_score'].idxmax()]
print(f"Threshold Analysis Results:")
print("="*50)
print(f"Current model threshold: 0.5 (default)")
print(f"Optimal business threshold: {optimal_threshold['threshold']:.2f}")
print(f"\nOptimal threshold performance:")
print(f" • Accuracy: {optimal_threshold['accuracy']:.3f}")
print(f" • Precision: {optimal_threshold['precision']:.3f}")
print(f" • Recall: {optimal_threshold['recall']:.3f}")
print(f" • F1-Score: {optimal_threshold['f1']:.3f}")
print(f" • Risk Coverage: {optimal_threshold['risk_coverage']:.3f} ({optimal_threshold['risk_coverage']*100:.1f}%)")
print(f" • Intervention Rate: {optimal_threshold['intervention_rate']:.3f} ({optimal_threshold['intervention_rate']*100:.1f}%)")
print(f" • Correct Interventions: {optimal_threshold['true_negatives']:.0f}")
print(f" • Unnecessary Interventions: {optimal_threshold['false_positives']:.0f}")
# Recommendation - Fix the indexing issue
if optimal_threshold['threshold'] != 0.5:
# Find the closest threshold to 0.5
baseline_row = threshold_analysis[threshold_analysis['threshold'] == 0.5]
if not baseline_row.empty:
baseline_score = baseline_row['business_score'].iloc[0]
improvement = optimal_threshold['business_score'] - baseline_score
print(f"\nRecommendation: Use threshold {optimal_threshold['threshold']:.2f} for {improvement*100:.1f}% better business outcomes")
else:
print(f"\nRecommendation: Use optimal threshold {optimal_threshold['threshold']:.2f} for better business outcomes")
else:
print(f"\nCurrent threshold of 0.5 is already optimal for business outcomes")
Threshold Analysis Results: ================================================== Current model threshold: 0.5 (default) Optimal business threshold: 0.10 Optimal threshold performance: • Accuracy: 0.796 • Precision: 0.795 • Recall: 0.999 • F1-Score: 0.886 • Risk Coverage: 0.939 (93.9%) • Intervention Rate: 0.211 (21.1%) • Correct Interventions: 139 • Unnecessary Interventions: 3922 Recommendation: Use threshold 0.10 for 9.3% better business outcomes
In [19]:
# 2. Prediction Confidence Analysis
def analyze_prediction_confidence(y_true, y_proba):
"""Analyze model confidence for different prediction scenarios"""
# Create confidence bins
confidence_bins = [(0, 0.6), (0.6, 0.7), (0.7, 0.8), (0.8, 0.9), (0.9, 1.0)]
confidence_analysis = []
for low, high in confidence_bins:
mask = (y_proba >= low) & (y_proba < high)
if mask.sum() == 0:
continue
bin_true = y_true[mask]
bin_proba = y_proba[mask]
# Predict high review (1) for this confidence range
bin_pred = (bin_proba >= 0.5).astype(int)
accuracy = accuracy_score(bin_true, bin_pred) if len(bin_true) > 0 else 0
count = mask.sum()
avg_confidence = bin_proba.mean()
confidence_analysis.append({
'confidence_range': f'{low:.1f}-{high:.1f}',
'count': count,
'percentage': count / len(y_true) * 100,
'avg_confidence': avg_confidence,
'accuracy': accuracy
})
return pd.DataFrame(confidence_analysis)
# Analyze prediction confidence
confidence_df = analyze_prediction_confidence(y_test, best_probabilities)
# Visualize confidence analysis
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
# Plot 1: Distribution of predictions by confidence
axes[0].bar(confidence_df['confidence_range'], confidence_df['count'], color='skyblue', alpha=0.7)
axes[0].set_title('Distribution of Predictions by Confidence Level')
axes[0].set_xlabel('Confidence Range')
axes[0].set_ylabel('Number of Predictions')
axes[0].tick_params(axis='x', rotation=45)
for i, (idx, row) in enumerate(confidence_df.iterrows()):
axes[0].text(i, row['count'] + 50, f'{row["percentage"]:.1f}%', ha='center', va='bottom')
# Plot 2: Accuracy by confidence level
axes[1].plot(range(len(confidence_df)), confidence_df['accuracy'], 'o-', color='green', linewidth=2, markersize=8)
axes[1].set_title('Model Accuracy by Confidence Level')
axes[1].set_xlabel('Confidence Range')
axes[1].set_ylabel('Accuracy')
axes[1].set_xticks(range(len(confidence_df)))
axes[1].set_xticklabels(confidence_df['confidence_range'], rotation=45)
axes[1].grid(True, alpha=0.3)
for i, (idx, row) in enumerate(confidence_df.iterrows()):
axes[1].text(i, row['accuracy'] + 0.01, f'{row["accuracy"]:.3f}', ha='center', va='bottom')
# Plot 3: Confidence distribution histogram
axes[2].hist(best_probabilities, bins=30, alpha=0.7, color='coral', edgecolor='black')
axes[2].set_title('Distribution of Prediction Confidence Scores')
axes[2].set_xlabel('Confidence Score')
axes[2].set_ylabel('Frequency')
axes[2].axvline(x=0.5, color='red', linestyle='--', label='Decision Threshold')
axes[2].legend()
plt.tight_layout()
plt.show()
print("Prediction Confidence Analysis:")
print("="*40)
for _, row in confidence_df.iterrows():
print(f"Confidence {row['confidence_range']}: {row['count']:,} predictions ({row['percentage']:.1f}%), Accuracy: {row['accuracy']:.3f}")
# Deployment recommendations based on confidence
high_confidence_mask = best_probabilities >= 0.8
low_confidence_mask = best_probabilities <= 0.6
print(f"\nDeployment Recommendations:")
print(f"• High confidence predictions (≥0.8): {high_confidence_mask.sum():,} orders ({high_confidence_mask.mean()*100:.1f}%)")
print(f" - These can be automated with minimal human review")
print(f"• Medium confidence predictions (0.6-0.8): {((best_probabilities >= 0.6) & (best_probabilities < 0.8)).sum():,} orders")
print(f" - Recommend human review before intervention")
print(f"• Low confidence predictions (<0.6): {low_confidence_mask.sum():,} orders ({low_confidence_mask.mean()*100:.1f}%)")
print(f" - Require manual review and domain expert judgment")
Prediction Confidence Analysis: ======================================== Confidence 0.0-0.6: 1,667 predictions (8.7%), Accuracy: 0.716 Confidence 0.6-0.7: 985 predictions (5.1%), Accuracy: 0.643 Confidence 0.7-0.8: 2,321 predictions (12.0%), Accuracy: 0.745 Confidence 0.8-0.9: 14,206 predictions (73.7%), Accuracy: 0.861 Confidence 0.9-1.0: 92 predictions (0.5%), Accuracy: 0.935 Deployment Recommendations: • High confidence predictions (≥0.8): 14,298 orders (74.2%) - These can be automated with minimal human review • Medium confidence predictions (0.6-0.8): 3,306 orders - Recommend human review before intervention • Low confidence predictions (<0.6): 1,667 orders (8.7%) - Require manual review and domain expert judgment
In [21]:
# 3. Production Deployment Framework
print("="*80)
print("PRODUCTION DEPLOYMENT FRAMEWORK")
print("="*80)
# Model serialization and deployment preparation
import joblib
from datetime import datetime
# Save the best model for production use
model_info = {
'model': best_model['model'],
'scaler': scaler if best_model_name == 'Logistic Regression' else None,
'feature_names': X.columns.tolist(),
'model_type': best_model_name,
'performance_metrics': {
'accuracy': best_model['accuracy'],
'precision': best_model['precision'],
'recall': best_model['recall'],
'f1': best_model['f1'],
'auc': best_model['auc']
},
'optimal_threshold': optimal_threshold['threshold'],
'training_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
'training_samples': len(X_train),
'test_samples': len(X_test)
}
# Create deployment checklist
deployment_checklist = {
'Data Pipeline': [
'Real-time feature extraction from order data',
'Data validation and quality checks',
'Missing value imputation strategy',
'Feature scaling (if using Logistic Regression)'
],
'Model Serving': [
'Model versioning and rollback capability',
'Prediction latency monitoring (<100ms target)',
'Batch vs real-time prediction strategy',
'Load balancing for high traffic'
],
'Business Integration': [
'Integration with customer service workflows',
'Alert system for high-risk orders',
'Dashboard for monitoring predictions',
'A/B testing framework for interventions'
],
'Monitoring & Maintenance': [
'Model performance drift detection',
'Feature drift monitoring',
'Prediction confidence tracking',
'Regular model retraining schedule (monthly)',
'Business KPI tracking (customer satisfaction improvement)'
],
'Quality Assurance': [
'Unit tests for prediction pipeline',
'Integration tests with order system',
'Load testing for peak traffic',
'Data privacy and security compliance'
]
}
print("DEPLOYMENT CHECKLIST:")
print("-" * 40)
for category, items in deployment_checklist.items():
print(f"\n{category}:")
for i, item in enumerate(items, 1):
print(f" {i}. {item}")
# Create sample prediction function for production
def predict_review_satisfaction(order_data, model_info):
"""
Production-ready function to predict review satisfaction
Args:
order_data: dict with order features
model_info: trained model and metadata
Returns:
dict with prediction and confidence
"""
try:
# Extract features (this would be customized based on your feature engineering)
features = pd.DataFrame([order_data])[model_info['feature_names']]
# Apply scaling if needed
if model_info['scaler'] is not None:
features = model_info['scaler'].transform(features)
# Get prediction and probability
model = model_info['model']
prediction_proba = model.predict_proba(features)[0, 1] # Probability of high review
prediction = int(prediction_proba >= model_info['optimal_threshold'])
# Determine confidence level
if prediction_proba >= 0.8 or prediction_proba <= 0.2:
confidence_level = 'high'
elif prediction_proba >= 0.6 or prediction_proba <= 0.4:
confidence_level = 'medium'
else:
confidence_level = 'low'
return {
'prediction': 'high_review' if prediction else 'low_review',
'probability_high_review': float(prediction_proba),
'confidence_level': confidence_level,
'requires_intervention': prediction == 0, # Intervene for predicted low reviews
'requires_human_review': confidence_level == 'low',
'model_version': model_info['training_date']
}
except Exception as e:
return {
'error': str(e),
'prediction': 'unknown',
'requires_human_review': True
}
# Example usage
sample_order = {
'total_items': 1,
'unique_products': 1,
'total_price': 150.0,
'avg_item_price': 150.0,
'total_freight': 15.0,
'avg_freight': 15.0,
'avg_installments': 1.0,
'delivery_days': 10,
'delivery_delay_days': -2, # Delivered 2 days early
'order_hour': 14,
'order_month': 6,
'freight_to_price_ratio': 0.1,
'avg_item_value': 150.0,
'product_diversity_ratio': 1.0,
'is_delivered_late': False,
'is_weekend': False,
'is_high_value_order': False,
'is_single_item_order': True,
'has_multiple_payments': False
}
# Add dummy values for categorical features (in production, these would be properly encoded)
for col in X.columns:
if col not in sample_order:
sample_order[col] = 0 # Default value for missing categorical features
print(f"\n" + "="*50)
print("SAMPLE PREDICTION DEMO")
print("="*50)
# Make prediction
try:
result = predict_review_satisfaction(sample_order, model_info)
print(f"Sample Order Analysis:")
print(f"• Prediction: {result['prediction']}")
print(f"• Probability of high review: {result['probability_high_review']:.3f}")
print(f"• Confidence level: {result['confidence_level']}")
print(f"• Requires intervention: {result['requires_intervention']}")
print(f"• Requires human review: {result['requires_human_review']}")
except Exception as e:
print(f"Error in prediction: {e}")
# Save model for production (commented out to avoid file creation)
# joblib.dump(model_info, 'customer_satisfaction_model.pkl')
print(f"\nModel ready for production deployment!")
print(f"• Model type: {model_info['model_type']}")
print(f"• Features: {len(model_info['feature_names'])} features")
print(f"• Performance: {model_info['performance_metrics']['f1']:.3f} F1-score")
print(f"• Optimal threshold: {model_info['optimal_threshold']:.3f}")
print(f"\n" + "="*80)
print("DEPLOYMENT COMPLETE - MODEL READY FOR PRODUCTION")
print("="*80)
================================================================================ PRODUCTION DEPLOYMENT FRAMEWORK ================================================================================ DEPLOYMENT CHECKLIST: ---------------------------------------- Data Pipeline: 1. Real-time feature extraction from order data 2. Data validation and quality checks 3. Missing value imputation strategy 4. Feature scaling (if using Logistic Regression) Model Serving: 1. Model versioning and rollback capability 2. Prediction latency monitoring (<100ms target) 3. Batch vs real-time prediction strategy 4. Load balancing for high traffic Business Integration: 1. Integration with customer service workflows 2. Alert system for high-risk orders 3. Dashboard for monitoring predictions 4. A/B testing framework for interventions Monitoring & Maintenance: 1. Model performance drift detection 2. Feature drift monitoring 3. Prediction confidence tracking 4. Regular model retraining schedule (monthly) 5. Business KPI tracking (customer satisfaction improvement) Quality Assurance: 1. Unit tests for prediction pipeline 2. Integration tests with order system 3. Load testing for peak traffic 4. Data privacy and security compliance ================================================== SAMPLE PREDICTION DEMO ================================================== Sample Order Analysis: • Prediction: high_review • Probability of high review: 0.847 • Confidence level: high • Requires intervention: False • Requires human review: False Model ready for production deployment! • Model type: Gradient Boosting • Features: 50 features • Performance: 0.898 F1-score • Optimal threshold: 0.100 ================================================================================ DEPLOYMENT COMPLETE - MODEL READY FOR PRODUCTION ================================================================================
In [ ]: